-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement pre-fetching in map() and gen() #521
Conversation
Deploying datachain-documentation with Cloudflare Pages
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #521 +/- ##
==========================================
+ Coverage 87.92% 88.04% +0.11%
==========================================
Files 102 102
Lines 10044 10085 +41
Branches 1363 1373 +10
==========================================
+ Hits 8831 8879 +48
+ Misses 871 865 -6
+ Partials 342 341 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
@@ -325,6 +325,7 @@ def settings( | |||
parallel=None, | |||
workers=None, | |||
min_task_size=None, | |||
prefetch: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q: why int? let's update the docs here (do we have some CI to detect these discrepancies btw (missing docs) cc @skshetry )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. A few question.
One more general question.
Does this implementation mean that we now won't start (at least the very first row) UDF until file is fetched? Before it was doing this "on-demand" I guess - when file is needed. I wonder how big of an issue it can be in certain scenarios (and especially if we decide to do prefetch for batches (agg, batch mapper)).
Yes, the File object is only passed to the UDF once prefetching is complete, but note that in this PR you need to specify |
ef79f67
to
9fd3155
Compare
# Ensure we're using a thread-local connection | ||
with self.clone() as wh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this clone()
should be outside the loop? Otherwise, we'll not be reusing the connection.
This adds a
prefetch
setting which enables async downloading of objects to the cache before running a generator or mapper UDF (see #40). The default is to use 2 workers, but it can be disabled using.setting(prefetch=0)
. Note that it has no effect if caching isn't enabled (caching is disabled by default).In order for this to work,
AbstractWarehouse.dataset_select_paginated()
is now required to be thread-safe, so query result pages are now buffered as a list in that function.